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Abstract 

Genome-wide association studies are based on the linkage disequilibrium pattern between common tagging 
single-nucleotide polymorphisms (SNPs) (i.e., SNPs having only common alleles) and true causal variants, and 
association studies with rare SNP alleles aim to detect rare causal variants. To better understand and explain the 
findings from both types of studies and to provide clues to improve the power of an association study with only 
common SNPs genotyped, we study the correlation between common SNPs and the presence of rare alleles 
within a region in the genome and look at the capability of common SNPs in strong linkage disequilibrium with 
each other to capture single rare alleles. Our results indicate that common SNPs can, to some extent, tag the 
presence of rare alleles and that including SNPs in strong linkage disequilibrium with each other among the 
tagging SNPs helps to detect rare alleles. 



Background 

In recent years, genome-wide association studies have 
identified hundreds of genetic variants that may be asso- 
ciated with many common diseases [1-3]. It is believed 
that the associated single-nucleotide polymorphisms 
(SNPs) detected from current association studies may 
represent linkage disequilibrium (LD) between a com- 
mon tagging SNP and true causal variants. Under the 
common disease/rare variants hypothesis, which sug- 
gests that many rare variants can contribute to the phe- 
notypic variation [4,5], association studies to detect rare 
alleles have become more and more important. In this 
study, we try to answer two questions: (1) Within a 
region in the genome, how well do common SNPs tag 
the presence of rare alleles? (2) When selecting common 
tagging SNPs for association studies to detect rare 
alleles, should we exclude SNPs in strong LD with each 
other (r 2 > 0.95), or does it help to capture more infor- 
mation on the rare alleles if we include tagging SNPs in 
strong LD (r 2 > 0.95) with each other? To answer the 
first question, we analyzed the correlation between com- 
mon SNPs and the number of rare alleles in samples of 
rare SNPs (i.e., SNPs containing rare alleles) in each 
region of the chromosomes. Then, for the second 
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question, we studied the change in correlation between 
a single rare SNP and common tagging SNPs that is 
achieved by including SNPs in strong LD with each 
other when selecting common tagging SNPs. 

Methods 

Sample 

We use the Genetic Analysis Workshop 17 (GAW17) 
data set, which is composed of 697 individuals in this 
study. The data include 24,487 SNPs, 74% (18,131) of 
which are considered rare SNPs with a minor allele fre- 
quency (MAF) less than 0.01 and only 12.8% of which 
are common SNPs with MAF > 0.05. Because of the 
unbalanced number of rare and common SNPs in the 
data, in order to study the capability of the common 
SNPs to tag rare variants, we incorporate into this data 
set genotype data from the International HapMap Pro- 
ject, release 28 (http://hapmap.ncbi.nlm.nih.gov/). The 
final data set includes 627 individuals from 7 popula- 
tions: European (88), Chinese (91), Chinese in Denver 
(90), Japanese (92), Luhya (98), Tuscan (61), and Yoruba 
(107). After removal of SNPs in perfect LD, we are left 
with 13,777 rare SNPs (MAF < 0.01) and 116,944 com- 
mon SNPs (MAF > 0.05). 
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Correlation between common SNPs and the presence of 
rare alleles 

We divide the genome into nonoverlapping 1-Mb bins. 
For each bin, we separate the rare SNPs from the com- 
mon SNPs. The common SNP value for each individual 
is the number of minor alleles. The correlation between 
the set of common SNPs and the numbers of rare alleles 
is calculated in each bin as follows. For n s randomly 
selected rare SNPs (here we studied n s = 5) in a bin, we 
quantify the number of rare alleles as the total number 
of rare alleles, y b that individual i (i - i, 2, ...,AT) carries. 
The correlation between the variable y t and the common 
SNPs in the bin is calculated over the N individuals in 
two ways. In the first way we calculate the Pearson cor- 
relation r between y t and each of the common SNPs, 
taking the maximum r 2 . In the second way we calculate 
the multiple correlation R 2 [6] between y t and the com- 
mon SNPs, using a multiple regression model. These 
two correlations are calculated for each consecutive 
region across the whole genome. We repeat the random 
sampling of the rare SNPs and the calculation of the 
correlation n r ln s times (i.e., the closest integer to n r ln s ) 
if n r >n s , where n r is defined as the total number of rare 
SNPs in a bin. 

We calculate the correlations between the common 
SNPs and the number of rare alleles in rare SNPs sepa- 
rately in each of the seven subpopulations, to test 
whether the tagging capability is different in different 
populations. We also calculate the correlation between 
common SNPs and the number of each of two types of 
rare alleles (synonymous and nonsynonymous) to test 
whether common SNPs have a different capability to tag 
these two types of rare alleles. 

To examine whether the correlations between com- 
mon SNPs and rare alleles are due to statistical noise, 
we perform a permutation test. We permute each of the 
common SNPs within the bin across individuals and cal- 
culate the correlations between the variable y { and the 
permuted common SNPs. Then the observed and per- 
mutation correlation distributions are compared using a 
Kolmogorov-Smirnov test. We also compare the means 
of the two distributions using a t test. 

Capability of common SNPs in strong LD to 
capture rare alleles 

We hypothesize that incorporating common SNPs in 
strong LD will capture significantly more variation 
resulting from rare alleles than using only the common 
SNPs in less strong LD with each other. We select the 
common SNPs within a 1-Mb region of each rare SNP 
and divide them into two sets. The first set is composed 
of the common tagging SNPs with LD of r 2 < 0.80 
between each pair; we call this set A. The second set is 
composed of the common SNPs with LD of r 2 < 0.95 



between each pair, which we call set B. So set B has two 
parts: all the SNPs in set A (r 2 < 0.80) and those SNPs 
in set (B - A) that are in higher LD with the SNPs in 
set A or between themselves (0.80 Kr 2 < 0.95). Any SNP 
in perfect LD (r 2 = 1) with another is excluded from the 
data. Then we calculate the multiple correlations R 2 [6] 
between each rare SNP and the set of common SNPs 
(set A and set B, respectively). Because R 2 always 
increases when the number of independent variables in 
the model increases, R 2 is always greater than or equal 
to j? 2 [6], where the subscripts A and B represent set A 
and set B, respectively. 
An F statistic, 

F = jR 2 B-R 2 A)/jn B -n A ) 
(l-Rl)/(N-n B -l)' 

where n A and n B are the numbers of SNPs in set A 
and set B, respectively, is calculated to test whether the 
increase in R 2 over R 2 A to predict the rare alleles is 
significant. Because R 2 increases with the number of 
explanatory terms in a model, we use the adjusted 
^ 2 (^adj) ' wn i cn adjusts for the number of explanatory 
common SNPs in the multiple regression model [6], to 
evaluate the multiple correlation: 

^ dj =l-(l-K 2 )-^-, (2) 
' N - n - 1 

where n is n A or n B . 

In order to test whether the increase in R 2 is due to 
the stronger LD among the SNPs in set B, which comes 
from the SNPs in set (B - A), or due to the larger num- 
ber of SNPs from set (B - A), we evaluate the signifi- 
cance of the F statistic by comparison to a sample of 
1,000 replicates of its permutation distribution, obtained 
by permuting across individuals the set of SNPs in set B 
but not in set A (i.e., the SNPs in set (B - A)), which 
breaks any LD structure between sets A and (B - A) 
but keeps the structure within the set (B - A). 

For each rare SNP, we also compare its multiple 
correlation R 2 d - with the common SNP set A having 
LD given by r 2 < 0.95 and with set B having LD given 
by r 2 < 0.99. 

Results 

Correlation between the number of rare alleles and 
common SNPs within a region 

Using all 627 samples, the correlation between the num- 
ber of rare alleles in any randomly selected five rare 
SNPs and a set of common SNPs within a 1-Mb region 
is less than 0.1 for both correlation measures. The cor- 
relation between the number of rare alleles and a set of 
common SNPs within subpopulations was larger than 
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that of the samples overall (Table 1; Figure 1). The 
mean adjusted multiple correlation R 2 d - for European, 
Chinese, Denver Chinese, Japanese, Luhya, Tuscan, and 
Yoruba ranged from 0.06 to 0.24 (Table 1). Compared 
with random correlations, which are given by correla- 
tions between the number of rare alleles and a set of 
randomly permuted common SNPs, there was no signif- 
icant difference in the total sample. In the subpopula- 
tions, however, the correlations between the number of 
rare alleles and the set of common SNPs were 



significantly different from random correlations (P < 
0.001) (Table 1), but the difference was quite small. 

In the total sample, the set of common SNPs has a 
correlation with the number of rare synonymous alleles 
^adj - 0.057 and with the number of rare nonsynon- 
ymous alleles R 2 d - =0.048; the difference, although 
small, is significant (P = 7.74 x 10~ 4 ). In the subpopula- 
tions, the set of common SNPs also showed higher 
correlations with the number of rare synonymous alleles 
than with the number of rare nonsynonymous alleles, 



Table 1 Mean multiple correlation R 2 , between (1) the set of common SNPs and the number of rare alleles, (2) 
permuted common SNPs and the number of rare alleles, (3) the set of common SNPs and the number of synonymous 
rare alleles, and (4) the set of common SNPs and the number of nonsynonymous rare alleles 

Population (1) Common (2) Random t test P Kolmogorov- (3) Common vs. (4) Common vs. t test P 

vs. rare SNPs correlation Smirnov test synonymous rare SNPs nonsynonymous rare 
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SNPs 




European 


0.078 


-0.022 


2.84 x 10" 9 


9.66 x 10" 15 


0.078 


0.077 


0.952 


Chinese 


0.067 


0.003 


6.80 x 10" 6 


1.28 x 10" 5 


0.090 


0.041 


0.024 


Denver 


0.063 


-0.002 


2.30 x 10" 6 


3.052 x 10" 10 
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0.064 


0.350 


Chinese 
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0.089 


0.004 


1.50 x 10" 8 


6.17 x 10" 12 


0.091 


0.081 


0.668 
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0.238 


-0.0006 


<2.2 x 10" 16 


<2.2 x 10" 16 
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-0.002 


0.001 


4.60 x 10" 6 


0.088 


0.045 
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Yoruba 


0.120 


-0.007 


<2.2 x 10" 16 


<2.2 x 10" 16 


0.142 


0.099 


0.008 


All samples 
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7.74 x 10~ 4 
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Figure 1 Distribution of the correlation r 2 between rare alleles and common SNPs in the subpopulations and overall. The correlation is 
between the common SNPs and the number of rare alleles present in five random rare SNPs within a 1-Mb region. X-axes are the correlation r 2 , 
y-axes are the probability densities. 
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and the difference was most significant in Yoruba 
(P = 0.008). Note that in Yoruba, although the average 
correlation between common SNPs and the number of 
rare alleles is not high (R^ = 0.12) , it is significantly dif- 
ferent from a random correlation, which suggests that 
common SNPs are able to capture some information on 
the number of rare alleles. In Yoruba, the set of common 
SNPs has a significantly smaller correlation with the num- 
ber of rare synonymous alleles than with the number of 
rare nonsynonymous alleles (P = 0.008), which may indi- 
cate that the common SNPs are more prone to detecting 
nonfunctional SNPs than functional SNPs in this popula- 
tion. The correlation between common SNPs and the 
number of rare alleles is highest in Luhya (R^ = 0.24) , 
but common SNPs show no significant difference in 
capturing synonymous and nonsynonymous SNPs. 

Capability of common SNPs in strong LD to capture rare 
variants within a region 

By comparing two correlations — the adjusted multiple 
correlation between a rare SNP and the set of common 
SNPs in set A (LD of r 2 < 0.80) and the adjusted multi- 
ple correlation between that rare SNP and the set of 
common SNPs in set B (composed of both the SNPs in 



set A with LD r 2 < 0.80 and the SNPs in stronger LD, 
0.80 <r 2 < 0.95) — we found that some rare SNPs showed 
higher correlations with the common SNPs in set B than 
with those in set A (Figure 2). The distributions of 
the two correlations are significantly different using a 
Kolmogorov-Smirnov test (P = 2.44 x 10~ 6 ), although 
their means are not significantly different by a t test (P = 
0.07). If set A is the set of common SNPs with LD < 0.95 
and set B is the set of common SNPs with LD < 0.99, 
then set B also shows higher correlation with some rare 
SNPs than set A does, and the difference of the distribu- 
tions of the two R 2 d - values is significant (Kolmogorov- 
Smirnov test P = 0.02). We used the F statistic to evaluate 
whether the increase in R 2 for set B is due to the extra 
SNPs in stronger LD in set B or is just due to chance. For 
the points in Figure 2 that show an increase in R 2 d - 
greater than 0.30, most of the increases are significant 
(nominal P < 10" 5 using an F test that assumes normality; 
P < 0.03 by permutation), except for two points (nominal 
P > 0.08, permutation P > 0.11). 

Discussion 

In this study, we found that within a region in the gen- 
ome, overall the common SNPs are not highly correlated 




Figure 2 Distribution of the multiple correlation R 2 between a rare SNP and a set of common SNPs within a 1-Mb region of the rare 
SNP. Each point represents a rare SNP. The x-axis is the adjusted R 2 (R 2 d A between the rare SNP and the common SNPs in set A, and the y- 
axis is the adjusted R 2 (R 2 d -) between the rare SNP and the common SNPs in set B. SNPs in set B have stronger LD than SNPs in set A, thus 
set B contains all the SNPs in set A and the SNPs that have stronger LD with those in set A or between themselves. In the left-hand panel, SNPs 
in set A have LD r 2 < 0.8 and SNPs in set B have LD r 2 < 0.95. In the right-hand panel, SNPs in set A have LD r 2 < 0.95 and SNPs in set B have 
LD r 2 < 0.99. 
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with the number of rare alleles, so they are not powerful 
for tagging the presence of rare alleles. But in subpopula- 
tions, the common SNPs can capture some information 
on the presence of rare variants, and their increased cor- 
relations are statistically significant but are often small 
(Table 1). We also found that including tagging SNPs in 
strong LD with each other is helpful in detecting rare 
alleles. 

Common SNPs have higher correlations with the pre- 
sence of rare SNPs in the subpopulations, which indi- 
cates that population structure influences the tagging 
power. The common SNPs have lower correlations with 
the presence of nonsynonymous SNPs, especially in the 
Yoruba population, which may indicate difficulty in cap- 
turing rare functional variants in that population. In 
addition to the presence of rare alleles, we also analyzed 
the correlation between common SNPs and another 
variable, a collapsing statistic for rare SNPs [7-9], which 
has the value 1 if a rare allele is present and the value 0 
if no rare alleles are present among several randomly 
selected SNPs within a genome region. We obtained 
similar results with the collapsing variable (data not 
shown). 

Our study suggests that we should not exclude SNPs 
in strong LD (e.g., r 2 > 0.95) from tagging SNPs in an 
association study, because they can help to detect rare 
SNPs. They are less helpful for predicting disease risk, 
however, because their attributable risk is so small; but 
the significant associations detected by them could be 
important for detecting new metabolic pathways. 

The multiple correlation R 2 could be overadjusted 
because the adjusting assumes independence of the 
common SNPs, which is not the case for our study. But 
we nevertheless get increased R 2 d - to tag rare SNPs by 
including SNPs in strong LD with each other among the 
tagging SNPs, which indicates their importance in an 
association study to detect causal variants. 
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Conclusions 

In this study, we found that, overall, common SNPs are 
not good at capturing the presence of rare alleles within 
a region of the genome, but they can capture some 
information on their presence in subpopulations. The 
common SNPs are more prone to capturing nonfunc- 
tional rare SNPs, especially in some populations. We 
also found that including tagging SNPs in strong LD 
with each other can be helpful in detecting rare variants. 
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